An informal explanation of regexps

A regular expression is basically a string of characters, where a few characters have special meaning. For instance, the dot represents any single character, the star represents any number of times the previous character, the vertical bar is a logical "or" between two alternatives, and parentheses can be used to group characters together. A backslash is necessary when you want to use one of these characters without its special meaning.

A regular expression is in fact a condensed way of representing a set of strings. Here are a few common examples:

I hope you get the picture. A precise definition follows.

A precise definition of regexps

A regular expression (sometimes called regexp or pattern) is a condensed representation of a set of strings. The strings in this set are said to be matches for the regular expression; others are said to be rejected by the regexp. For instance, the regular expression \.c$ matches all strings ending in .c.

A regular expression is zero or more branches, separated by |. It matches anything that matches one of the branches.

A branch is zero or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.

A piece is an atom possibly followed by *, +, or ?. An atom followed by * matches a sequence of 0 or more matches of the atom. An atom followed by + matches a sequence of 1 or more matches of the atom. An atom followed by ? matches a match of the atom, or the null string.

An atom is a regular expression in parentheses (matching a match for the regular expression), a range (see below), . (matching any single character), ^ (matching the null string at the beginning of the input string), $ (matching the null string at the end of the input string), a \ followed by a single character (matching that character), or a single character with no other significance (matching that character).

A range is a sequence of characters enclosed in []. It normally matches any single character from the sequence. If the sequence begins with ^, it matches any single character not from the rest of the sequence. If two characters in the sequence are separated by -, this is shorthand for the full list of ASCII characters between them (e.g. [0-9] matches any decimal digit). To include a literal ] in the sequence, make it the first character (following a possible ^). To include a literal -, make it the first or last character.


Back ยท Top